Say it in R with "by", "apply" and friends


f <- function(x) x^2
sapply(1:10, f)
[1]   1   4   9  16  25  36  49  64  81 100

Here is an example where we calculate the means of the various measurements of the species of the famous iris data set using by.

by


do.call("rbind", as.list(
  by(iris, list(Species=iris$Species), function(x){
    y <- subset(x, select= -Species)
    apply(y, 2, mean)
  }
)))

           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

Now let’s find alternative ways of expressing ourselves, using other words/functions of the R language, such as aggregate, apply, sapply, tapply, data.table, ddply, sqldf, and summaryBy.

aggregate

The aggregate function splits the data into subsets and computes summary statistics for each of them. The output of aggregate is a data.frame, including a column for species.

iris.x <- subset(iris, select= -Species)
iris.s <- subset(iris, select= Species)
aggregate(iris.x, iris.s, mean)

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

apply and tapply

The combination of tapply and apply achieves a similar result, but this time the output is a matrix and hence we loose the column with the species. The species are now the row names.

apply(iris.x, 2, function(x) tapply(x, iris.s, mean))

           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

split and apply

Here we split the data first into subsets for each specie and calculate then the mean for each column in the subset. The output is a matrix again, but transposed.

sapply(split(iris.x, iris.s), function(x) apply(x, 2, mean))

             setosa versicolor virginica
Sepal.Length  5.006      5.936     6.588
Sepal.Width   3.428      2.770     2.974
Petal.Length  1.462      4.260     5.552
Petal.Width   0.246      1.326     2.026

ddply

Hadley Wickham’s plyr package provides tools for splitting, applying and combining data. The function ddply is similar to the by function, but it returns a data.frame instead of a by list and maintains the column for the species.

library(plyr)
ddply(iris, "Species", function(x){
    y <- subset(x, select= -Species)
    apply(y, 2, mean)
  })

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

doBy

The summaryBy function of the doBy package by Søren Højsgaard and Ulrich Halekoh has a very intuitive interface, using formulas.

library(doBy)
summaryBy(Sepal.Length + Sepal.Width + Petal.Length + Petal.Width ~ Species, data=iris, FUN=mean)

     Species Sepal.Length.mean Sepal.Width.mean Petal.Length.mean Petal.Width.mean
1     setosa             5.006            3.428             1.462            0.246
2 versicolor             5.936            2.770             4.260            1.326
3  virginica             6.588            2.974             5.552            2.026

sqldf

If you are fluent in SQL, then the sqldf library by Gabor Grothendieck might be the one for you.

library(sqldf)
sqldf("select Species, avg(Sepal_Length), avg(Sepal_Width), 
    avg(Petal_Length), avg(Petal_Width) from iris 
    group by Species")

     Species avg(Sepal_Length) avg(Sepal_Width) avg(Petal_Length) avg(Petal_Width)
1     setosa             5.006            3.428             1.462            0.246
2 versicolor             5.936            2.770             4.260            1.326
3  virginica             6.588            2.974             5.552            2.026

data.table

The data.table package by M Dowle, T Short and S Lianoglou is the underground rock star to me. It provides an elegant and fast way to complete our task. The statement reads in plain English from right to left: take columns 1 to 4, split them by the factor in column “Species” and calculate on the sub data (.SD) the means.

library(data.table)
iris.dt <- data.table(iris)
iris.dt[,lapply(.SD,mean),by="Species",.SDcols=1:4]

        Species Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]     setosa        5.006       3.428        1.462       0.246
[2,] versicolor        5.936       2.770        4.260       1.326
[3,]  virginica        6.588       2.974        5.552       2.026

apply

I should mention that R provides the iris data set also in an array form. The third dimension of the iris3 array holds the specie information. Therefore we can use the apply function again, we go down the third and then the second dimension to calculate the means.

apply(iris3, c(3,2), mean)

           Sepal L. Sepal W. Petal L. Petal W.
Setosa        5.006    3.428    1.462    0.246
Versicolor    5.936    2.770    4.260    1.326
Virginica     6.588    2.974    5.552    2.026

Conclusion

Many roads lead to Rome, and there are endless ways of explaining how to get there. I only showed a few I know off, and I am curious to hear yours.

As a matter of courtesy I should mention the unkownR package by Matthew Dowle. It helps you to discover what you don’t know that you don’t know in R. Thus, it can help to build your R vocabulary.

Of course there is a key difference between R and English. R tells me right away when I make a mistake. Human readers are far more forgivable, but please do point out to me where I made mistakes. I am still hopeful that I can improve, but I need your help.